This document summarises the data from https://www.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.001081. This study was produced by Michael Hall from Zam’s group and describes the development of the DRPRG tool that allows for drug resistance prediction using genome graphs.

In doing so, Michael collated and carefully(!) curated an enriched WHO dataset, which is summarised here.

Summary:

A breakdown of R/S phenotypes for the samples can be seen in the plot below:

pheno_data %>%
  select(-run, -bioproject, -biosample) %>%
  summarise(across(everything(), ~ list(R = sum(. == "R"), S = sum(. == "S")))) %>%
  mutate(phenotype = c("R", "S")) %>% 
  pivot_longer(-phenotype, names_to = "antibiotic", values_to = "value") %>%
  as.data.frame() %>%
  mutate(value = as.numeric(as.character(value))) %>%
  mutate(antibiotic = case_when(antibiotic == "para-aminosalicylic_acid" ~ "PAS",
                                TRUE ~ antibiotic)) %>%
    ggplot(., aes(x = antibiotic, y = value)) +
    geom_bar(aes(fill = phenotype), position = "dodge",stat = "identity") +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 90, vjust = 0, size = 8)) +
  geom_text(
    stat = "identity", 
    aes(label = value, group = phenotype),
    vjust = -1,
    position = position_dodge(width = 0.9),
    size = 2.5) +
  ylab("# samples") +
  ggtitle("WHO-enriched dataset, DRPRG")

How was this data created?

According to the README.md in the data source, the phenotype data can be found in config/illumina.samplesheet.csv.
It was synthesized by summarising the experiments found in config/samplesheets/ using workflow/notebook/notepad.ipynb. This notebook is large and includes various analyses but the one we are interested in is “Data Collation”.

The first steps involved creating a WHO base dataset with two different data sources.

  1. “gentb”, which is labelled as WHO correspondance (either a subset or superset of 2.).

  2. “WHO”, which I’m assuming is from the latest catalogue.

  3. Lots of data cleaning, resolving discrepancies between the two datasets and extra metadata added using code from https://github.com/mbhall88/WHO-correspondence/blob/main/docs/fill_in_who_samplesheet.py.

Next, data was sequentially added and cleaned data from various publications:

dataset publication pheno_method
gentb WHO correspondance various
WHO https://www.thelancet.com/journals/lanmic/article/PIIS2666-5247(21)00301-3/fulltext various
trisakil https://doi.org/10.1080/22221751.2022.2099304 various
Smith https://pubmed.ncbi.nlm.nih.gov/33055186/ liquid MGIT 960 system (Bactec MGIT SIRE and PZA package inserts; Becton, Dickinson) and solid 7H10 agar proportion method
peker https://doi.org/10.1099/mgen.0.000695 incredibly not noted, only reference in methods is 'All MTB isolates were phenotypically tested for drug susceptibility (phenotypic DST)'
merker https://doi.org/10.1038/s41467-022-32455-1 various
finci https://pubmed.ncbi.nlm.nih.gov/35907429/ BACTEC MGIT 960 DST and Sensititre MYCOTB MIC plate (binary results reported)
leah_bdq https://doi.org/10.1101/2022.12.08.519610 BACTEC MGIT 960 DST
marco_pheno https://doi.org/10.3389/fmicb.2023.1104456 not stated, just that they were 'according to WHO classification'
lempens_acc https://doi.org/10.1016/j.ijid.2020.08.042 LJ slopes and 7H11 plates, proportional

Then, the following was noted: “Get the BioProject of all BioSamples with antibiogram data in NCBI. Once I have the BioProject, I can …..download the antibiogram table”.

This resulted in an extra 1073 samples being added to the superset with phenotypes for at least one drug.

bioproj pheno_method_2
PRJNA353873 MGIT, MICs listed
PRJNA413593 MGIT, MICs listed
PRJNA438921 MGIT, MICs listed
PRJNA557083 MGIT, MICs listed
PRJNA650381 MGIT and proportional agar, MICs listed for both
PRJNA663350 MGIT, MICs listed
PRJNA717333 96 well plate, MICs listed
PRJNA824124 MGIT, MICs listed
PRJNA834625 LJ slopes, MICs listed
PRJNA888434 MGIT, MICs listed